Introduction to Open Data Science - Course Project

About the project

date()
## [1] "Thu Dec 03 00:02:12 2020"

The text continues here.

What to expect for this course? The truth is, I don’t know. I have some previous background in programming, mostly with Python, but I have never studied data analysis in any way.

For this course, I expect to learn data analytics work flow and R programming.

Link to my GitHub repository for this course: https://github.com/JariRintaaho/IODS-project


Week 2: Regression and model validation

date()
## [1] "Thu Dec 03 00:02:12 2020"
#The data is located in a server. Easyest way to get it, is to use read.csv(url)

url="http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
learning2014=read.csv(url)

str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

The dataset learning2014 consist 166 rows and 7 variables. The structure of the data is shown above. The dataset is handled using data.frame structures in R. The data is related to International survey o Approaches to Learning by Kimmo Vehkalahti.

# ggplot2 is an excellent library for ploting in R. With 3 lines, it is possible to plot practically everything you have ever wanted to see from a datafile.

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

ggpairs(learning2014, title="Practically everything you want to see from a data set", ggplot2::aes(colour=gender), method = c("everything", "pearson")) 
## Warning in warn_if_args_exist(list(...)): Extra arguments: "method" are being
## ignored. If these are meant to be aesthetics, submit them using the 'mapping'
## variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the ggpairs-plot one can see that all the variables, except points, are more or less normally distributed. The gender does not effect dramatically for points. The distributions for males and females are very similar.

# selected explanatory variables for the regression model are: attitude, surf and stra. The target variable is points. 

model = lm(points ~ attitude + surf + stra, data=learning2014) 

summary(model)
## 
## Call:
## lm(formula = points ~ attitude + surf + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## surf         -0.5861     0.8014  -0.731  0.46563    
## stra          0.8531     0.5416   1.575  0.11716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
# The model seems to fit to the dataset with sufficient accuracy.

The plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage

par(mfrow = c(2,2))
plot(model, which=c(1,2,5))

The linear model seems to work OK.


Week 3: Logistic regression

date()
## [1] "Thu Dec 03 00:02:17 2020"

Part 2: downloading the data

url = "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt" 

data=read.csv(url, sep=",")

The glimpse of the data:

The glimpse is much nicer that the summary(). At the moment, we are not interested in statistical parameters.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
glimpse(data)
## Rows: 382
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G...
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, ...
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "...
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", ...
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,...
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "ser...
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other...
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation...
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "...
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "ye...
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mothe...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,...
## $ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,...
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no",...
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye...
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes...
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", ...
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", ...
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,...
## $ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, ...
## $ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 1...
## $ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 1...
## $ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, ...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F...

The data seems to have 382 rows and 35 columns. There are several numerical columns, several categorical columns and even one logical column. The data is somehow related to students’ alcohol consumption.

Part 3: Hypotesis

The aim of this task is to analyze alcohol consumption level (logical variable high_use) using four parameters. The interesting parameters are freetime, famrel, goout and absences. It think it is quite obvious that those students how has lot of free time and go out often use more alcohol. According to scientific consensus, family relations affect alcohol consumption. Based on personal experience related to alcohol consumption and studies, high level of alcohol consumption causes absence.

Part 4: Histograms and relations with alcohol consumption

First, let’s draw four histograms. One for each variable

# par makes multiple plots look nicer :)
par(mfrow=c(2,2))
hist(data$freetime)
hist(data$famrel)
hist(data$goout)
hist(data$absences)

It seem that freetime and goout are distributed in more or less Gaussian way. Most of the studest have free time and their going out is average.

The average family relation is skewed. Most of the students have family relation 3 to 5. However, there are still some, whose family relation is 1 or 2.

The absence is also skewed. Nearly all of the students have absence less than 20. However, there seems to be at least one, whose absence is larger than 50.

Let’s then see, how the box plots look like

# ggplot 2 has some nice plot styles.
library(ggplot2)

# par makes multiple plots look nicer :)
par(mfrow=c(2,2))
g1 = ggplot(data, aes(x = high_use, y = freetime))
g1 = g1 + ggtitle("Student freetime by alcohol consumption")
g1 + geom_boxplot() + ylab("freetime")

g2 = ggplot(data, aes(x = high_use, y = famrel))
g2 = g2 + ggtitle("Student famrel by alcohol consumption")
g2 + geom_boxplot() + ylab("famrel")

g3 = ggplot(data, aes(x = high_use, y = goout))
g3 = g3 + ggtitle("Student goout by alcohol consumption")
g3 + geom_boxplot() + ylab("goout")

g4 = ggplot(data, aes(x = high_use, y = absences))
g4 = g4 + ggtitle("Student absences by alcohol consumption")
g4 + geom_boxplot() + ylab("absences")

Based on the findings (the box plots), free time seems to have no effect on alcohol consumption. In both groups (high users and low users) the amouth of free time is identical.

Family relations seems to have effect on alcohol consumption. However, with this parameter there are two outlier points. They are not high users, but they have low scores in family relations.

It seems that going out is linked to being a heavy user of alcohol. Thus, there is one outlier point, who goes out often, but doesn’t use much alcohol.

Absence and being a high user seems to go hand in hand. But, not always. There are lots of outlier points. There are several students, who have plenty of absence, but they are not high users.

Based on the findings, low familiy relations and going out often seems to have strong relations for being a high user of alcohol. Also absence has some explanatory value. Opposite to original hypothesis, the amount of free times seems not to affect on alcohol consumption.

Part 5

The logistic regression model can be built using a single command

m <- glm(high_use ~ freetime + famrel +goout + absences, 
         data = data, family = "binomial")

The model summary. The Pr values for each variable shows clear difference. The smaller the Pr value of a parameter is the better explainer of the parameter is. The rule of thumb is, that if Pr < 0.05, it has value for the model. In this case, goout can be seen as an excellent parameter. Both absences and famrel have some value for the model for explaing the high use of alcohol. The free time seems to have no use for the model.

summary(m)
## 
## Call:
## glm(formula = high_use ~ freetime + famrel + goout + absences, 
##     family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9033  -0.7761  -0.5096   0.8915   2.3925  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.71107    0.68985  -3.930 8.50e-05 ***
## freetime     0.22413    0.13868   1.616 0.106054    
## famrel      -0.43185    0.13766  -3.137 0.001707 ** 
## goout        0.73875    0.12547   5.888 3.91e-09 ***
## absences     0.05893    0.01611   3.658 0.000254 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 385.67  on 377  degrees of freedom
## AIC: 395.67
## 
## Number of Fisher Scoring iterations: 4

The coefficients odds rations and their intervals. The coefficients does not tell much for anyone. But the odds rations and especially their intervals say. If the odds ratio of a parameter is exactly 1, the parameter has no explanatory value for the model. For odds ratio < 1, the parameter has negative correlation and for odds ratio > 1, the correlation is positive.

As one clearly sees, the 95 % confidence boundary for freetime parameter crosses the critical value 1. Therefore, one cannot say, if increased freetime cause high use of alcohol or not.

As expected, family relations have negative correlation. Those studets with high family relations are less likely to be high users of alcohol.

Those students, who go out often, are high users of alcohol, with no doupt. However, the positive correlation between being high user and having lots of absence is only small.

coef(m)
## (Intercept)    freetime      famrel       goout    absences 
## -2.71107006  0.22413401 -0.43185038  0.73875213  0.05893415
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI = confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR      2.5 %    97.5 %
## (Intercept) 0.06646565 0.01651175 0.2485937
## freetime    1.25123868 0.95490180 1.6469720
## famrel      0.64930652 0.49374170 0.8486899
## goout       2.09332169 1.64772445 2.6976795
## absences    1.06070539 1.02929793 1.0978106

Part 6

The predictions for the model can be calculate using predict() function. The predicted values are stored to the same data frame as the actual data is.

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
data <- mutate(data, probability = probabilities)

# use the probabilities to make a prediction of high_use
data <- mutate(data, prediction = (probability > 0.5))

The requested 2 x 2 cross tabulation of predictions versus the actual values. It can easily see, that the model prediction errors are not symmetrical. The tends to miss most of the high users and give some false positive predictions.

# tabulate the target variable versus the predictions
table(high_use = data$high_use, prediction = data$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   252   18
##    TRUE     68   44

Week 4: Clustering and classification

date()
## [1] "Thu Dec 03 00:02:18 2020"

Task 1

As you can see, this file have been made and uploaded to GitHub.

# In order to keep the Rmd file clean, all the libraries are downloaded here. Keep scrolling.
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(corrplot)
## corrplot 0.84 loaded
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble  3.0.4     v purrr   0.3.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x MASS::select()  masks dplyr::select()
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Task 2

The data to be analyze this week is called Boston. It comes from MASS package. The data is rather classic set of information. It consist data related to housing values in suburbs of Boston and several other parameters including air quality, crime rate and proportion of people of color at those suburbs. The dataset is rather old. It is first published in 1978.

data("Boston")

The str() method is called for checking the basic structure of the data. The data has 506 rows and 14 columns. The variable names and types are listed here:

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Task 3

Here is the pairs-plot of the dataset. Since the data has quite many columns, it practically shows you nothing.

pairs(Boston)

It is better to use corrplot()-method for checking the data set. The tidyverse and corrplot packages were downloaded for this.

The correlations between different parameters are shown here. The larger and darker the circle is, the stronger the correlation is. Blue represents positive and red negative correlation.

cor_matrix<-cor(Boston) %>% round(digits=2)

corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

The histograms for each variable can be drawn using ggplot2. Traditionally, the data set is used for demonstrating effects of different parameters to housing values.

ggplot(gather(Boston), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x')

Task 4

In this task, the data set “Boston” is standardized and crime rate is switched to categorical variable.

As one can see, the mean value for each parameter is zero after scaling.

## First, the data set is scaled and summary is printed
boston_scaled <- scale(Boston)
boston_scaled=as.data.frame(boston_scaled)

summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# Then the crime rate is switched to categorical variable.
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

The final part of this task is to split the scaled data set into test and train data using 80-20 split.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set 
test <- boston_scaled[-ind,]

After the model is trained, the results it gives to the test set is less-biased estimator for the actual model accuracy.

Task 5

First, the LDA-model is fitted to the data set. This time, the catergorical crime rate is used as target variable.

# These simple models can be applied using a single line of code.
lda.fit <- lda(crime ~ ., data = train)

A new function “lda-arrows” is defined for the task. The function draws the LDA bi-plot.

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

One can see, that is actually relatively easy to separate the high-crime rate areas

Task 6

First, the correct classes are separated from the test set

# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)

The rest of the test data is fed into the trained LDA model. The cross tabulated results are printed.

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15       4        2    0
##   med_low    7      10        8    0
##   med_high   1       9       18    1
##   high       0       0        1   26

As the results show, the model can predict the high crime rate suburbs with nearly 100% accuracy. However, the model truly struggles when it tries to decide, should a suburb belong to low or med_low category. In addition, there is some problems to find the difference between med_low and med_high suburbs.

Task 7

First, let’s reload and standardize the Boston data set

data("Boston")
boston_scaled <- scale(Boston)
boston_scaled=as.data.frame(boston_scaled)

Second, let’s build a k-mean clustering algorithm with the scaled boston data.

Let’s select 2 as the number of clusters. Since 14x14 pair plot is bit difficult to look at. Let’s draw three smaller ones.

k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled[1:5], col = km$cluster)

pairs(boston_scaled[5:10], col = km$cluster)

pairs(boston_scaled[10:14], col = km$cluster)

The pair plots shows nice separation of two groups. The results are quite simillar as with the LDA algorithm. With LDA we saw that it is relatively easy to find the suburbs with high crime rate. But there was significant overlapping with suburbs that had low, med_low and med_high crime rate.

Bonus

First, let’s perform the k-means clustering using 3 centers.

km2 <-kmeans(boston_scaled, centers = 3)

Then, let’s try to fit the LDA to the new k-means model output.

# These simple models can be applied using a single line of code.
lda.fit2 <- lda(km2$cluster ~ ., data = boston_scaled)

Now, let’s look the results. It seems that the model can predict detect the suburbs with high crime rate using this unsupervised learning method.

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(km2$cluster)

# plot the lda results
plot(lda.fit2, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Super bonus

First part of the super bonus task

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)


plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Week 5 - Dimensionality reduction techniques

This first chunk is just for downloading data and libraries.

date()
## [1] "Thu Dec 03 00:02:25 2020"
url="http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt"

human=read.csv(url)

library(corrplot)
library(tidyverse)
library(ggplot2)

Task 1

The data combines several indicators from most countries in the world

  • “Country” = Country name

Health and knowledge

  • “GNI” = Gross National Income per capita
  • “Life.Exp” = Life expectancy at birth
  • “Edu.Exp” = Expected years of schooling
  • “Mat.Mor” = Maternal mortality ratio
  • “Ado.Birth” = Adolescent birth rate

Empowerment

  • “Parli.F” = Percetange of female representatives in parliament

  • “Edu2.F” = Proportion of females with at least secondary education

  • “Edu2.M” = Proportion of males with at least secondary education

  • “Labo.F” = Proportion of females in the labour force

  • “Labo.M” " Proportion of males in the labour force

  • “Edu2.FM” = Edu2.F / Edu2.M

  • “Labo.FM” = Labo2.F / Labo2.M

Summary of each individual variable:

print(summary(human))
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

Histograms of different variables in the data set.

human %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Corrplot is nice way to inspect linear correlation coefficients between variables in a dataset.

cor_matrix<-cor(human) %>% round(digits=2)

corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

There are several parameters, that seems to correlate rather nicely between each other. For example it is rather obvious, that high life expectancy correlates negatively with maternal mortality rate. One thing was not so obvious, but still interesting. Proportion of females in the labour force does not correlate with anything.

Task 2

PCA model and the biplot for non-standardized data.

## PCA model and biplot
pca_human <- prcomp(human)

s <- summary(pca_human)

# rounded percetanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ]*100, digits = 1)

# create object pc_lab to be used as axis labels
pc_lab=paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.5, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

It can be clearly seen, that GNI is only explaining variable for non-standardized data.

Task 3

Same model, but for scaled data.

human_std <- scale(human)

## PCA model and biplot
pca_human <- prcomp(human_std)

s <- summary(pca_human)

# rounded percetanges of variance captured by each PC
pca_pr <- round(1*s$importance[2, ]*100, digits = 1)

# create object pc_lab to be used as axis labels
pc_lab=paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.5, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

From the data, three different groups of variable can be detected. The first group: In general, Proportion of females in the labour force (“Labo.F”) and Percetange of female representatives in parliament (“Parli.F”) seems to correlate positively. The explanation is more or less obvious. If females participate labour force, the they also participate politics.

Second group consists two variables: Maternal mortality ratio (“Mat.Mor”) and Adolescent birth rate (“Ado.Birth”). This is also rather obvious positive correlation. Maternal mortality ratio is often correlated with income level.

The third group consists last four parameters: “GNI” = Gross National Income per capita, “Life.Exp” = Life expectancy at birth, “Edu.Exp” = Expected years of schooling and Edu2.FM = Proportion of females with at least secondary education / Proportion of males with at least secondary education. These third group parameters correlate negatively with the second group.

Task 4

First parameter is a linear combination of six parameters: GNI, Life.Exp, Edu.Exp, Edu2.FM, Mat.Mor and Ado.Birth. This parameter could be called for example “Economical index”.

Second PC parameter is a linear combination of Labo.FM and Parli.F. This parameter can be called for example “Equality index”

Task 5

Let’s first download the data and look the structure

library(FactoMineR) 
data(tea)

print(str(tea))
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
## NULL

The dataset has 300 rows and 36 variables. For shake of simplicity, let’s narrow down the data set by removing most of the variables

keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- subset(tea, select=keep_columns)

print(summary(tea_time))
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 

Let’s visualize the data set.

# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## List of 1
##  $ axis.text.x:List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : num 8
##   ..$ hjust        : num 1
##   ..$ vjust        : NULL
##   ..$ angle        : num 45
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi FALSE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

Data shows that: People prefer Earl Grey tea and has no lear opinion about sugar.

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model


# visualize MCA
plot(mca, invisible=c("ind"))

Data clearly shows, that we have a strong hipster cluster. The people, that drinks unpacked tea buys that from tea shops.

Week 6 - Analysis of longitudinal data

This first chunk is just for libraries.

date()
## [1] "Thu Dec 03 00:02:27 2020"
library(corrplot)
library(tidyverse)
library(ggplot2)
library(tidyr)
library(dplyr)

Task 1. Analysing RATS data

First, let’s load RATS data set and look its structure

url_RATS="https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt"

RATS=read.csv(url_RATS, sep="\t")

RATS$ID=as.factor(RATS$ID)
RATS$Group=as.factor(RATS$Group)


glimpse(RATS)
## Rows: 16
## Columns: 13
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ WD1   <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 4...
## $ WD8   <int> 250, 230, 250, 255, 260, 265, 275, 255, 415, 420, 445, 560, 4...
## $ WD15  <int> 255, 230, 250, 255, 255, 270, 260, 260, 425, 430, 450, 565, 4...
## $ WD22  <int> 260, 232, 255, 265, 270, 275, 270, 268, 428, 440, 452, 580, 4...
## $ WD29  <int> 262, 240, 262, 265, 270, 275, 273, 270, 438, 448, 455, 590, 4...
## $ WD36  <int> 258, 240, 265, 268, 273, 277, 274, 265, 443, 460, 455, 597, 4...
## $ WD43  <int> 266, 243, 267, 270, 274, 278, 276, 265, 442, 458, 451, 595, 4...
## $ WD44  <int> 266, 244, 267, 272, 273, 278, 271, 267, 446, 464, 450, 595, 5...
## $ WD50  <int> 265, 238, 264, 274, 276, 284, 282, 273, 456, 475, 462, 612, 5...
## $ WD57  <int> 272, 247, 268, 273, 278, 279, 281, 274, 468, 484, 466, 618, 5...
## $ WD64  <int> 278, 245, 269, 275, 280, 281, 284, 278, 478, 496, 472, 628, 5...
head(RATS)
##   ID Group WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
## 1  1     1 240 250  255  260  262  258  266  266  265  272  278
## 2  2     1 225 230  230  232  240  240  243  244  238  247  245
## 3  3     1 245 250  250  255  262  265  267  267  264  268  269
## 4  4     1 260 255  255  265  265  268  270  272  274  273  275
## 5  5     1 255 260  255  270  270  273  274  273  276  278  280
## 6  6     1 260 265  270  275  275  277  278  278  284  279  281
tail(RATS)
##    ID Group WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
## 11 11     2 445 445  450  452  455  455  451  450  462  466  472
## 12 12     2 555 560  565  580  590  597  595  595  612  618  628
## 13 13     3 470 465  475  485  487  493  493  504  507  518  525
## 14 14     3 535 525  530  533  535  540  525  530  543  544  559
## 15 15     3 520 525  530  540  543  546  538  544  553  555  548
## 16 16     3 510 510  520  515  530  538  535  542  550  553  569

The RATS file has 16 time series with 11 time series each. The time serieses can be spitted ont three different groups.

Now, let’s convert the RATS into “long format”

RATSL <- RATS %>%
  gather(key = WD, value = Weight, -ID, -Group) %>%
  mutate(Time = as.integer(substr(WD,3,4))) 

str(RATSL)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : chr  "WD1" "WD1" "WD1" "WD1" ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,...
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, ...
## $ WD     <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, ...
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, ...
head(RATSL)
##   ID Group  WD Weight Time
## 1  1     1 WD1    240    1
## 2  2     1 WD1    225    1
## 3  3     1 WD1    245    1
## 4  4     1 WD1    260    1
## 5  5     1 WD1    255    1
## 6  6     1 WD1    260    1
tail(RATSL)
##     ID Group   WD Weight Time
## 171 11     2 WD64    472   64
## 172 12     2 WD64    628   64
## 173 13     3 WD64    525   64
## 174 14     3 WD64    559   64
## 175 15     3 WD64    548   64
## 176 16     3 WD64    569   64

Let’s draw each individual time series.

p1 <- ggplot(RATSL, aes(x = Time, y = Weight, group = ID))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ Group, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
p6

It can be seen, that group 1 is clearly separated data set. Group 2 has one clear out lier data series. Groups 2 and 3 seems not to differ much.

Let’s standardize the data sets. And then plot them.

# Standardise the scores:
RATSL <- RATSL %>%
  group_by(Time) %>%
  mutate( stdWeight = (Weight - mean(Weight))/sd(Weight) ) %>%
  ungroup()
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID        <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1,...
## $ Group     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, ...
## $ WD        <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "...
## $ Weight    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 55...
## $ Time      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, ...
## $ stdWeight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.881900...
p1 <- ggplot(RATSL, aes(x = Time, y = stdWeight, group = ID))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ Group, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(name = "standardized Weight")
p6

When standardized the limes looks more or less the same as without standardization.

Let’s then look the box plots.

p1 <- ggplot(RATSL, aes(x = factor(Time), y = Weight, fill = Group))
p2 <- p1 + geom_boxplot(position = position_dodge(width = 0.9))
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + theme(legend.position = c(0.8,0.8))
p5 <- p4 + scale_x_discrete(name = "week")
# Black & White version:
#p6 <- p5 + scale_fill_grey(start = 0.5, end = 1)
p5

With the box plots, the difference between groups 2 and 3 comes clear. However, each of the groups have clearsly outliers.

Let’s look the mean values for each group.

RATSL8S <- RATSL %>%
  filter(Time > 0) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()
## `summarise()` regrouping output by 'Group' (override with `.groups` argument)
p1 <- ggplot(RATSL8S, aes(x = Group, y = mean))
p2 <- p1 + geom_boxplot()
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white")
## Warning: `fun.y` is deprecated. Use `fun` instead.
p5 <- p4 + scale_y_continuous(name = "mean(Weight)")
p5

The most obvious outlier is for Group 2. Let’s remove it.

# Remove the outlier:
RATSL8S1 <- RATSL8S %>%
  filter(mean < 550)
p1 <- ggplot(RATSL8S1, aes(x = Group, y = mean))
p2 <- p1 + geom_boxplot()
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white")
## Warning: `fun.y` is deprecated. Use `fun` instead.
p5 <- p4 + scale_y_continuous(name = "mean(Weight), weeks 1-8")
p5

Now, let’s test, how good linear fitting we can do?

# Add the baseline from the original data as a new variable to the summary data:
baseline <- RATS$WD1
RATSL8S2 <- RATSL8S %>%
  mutate(baseline)
glimpse(RATSL8S2)
## Rows: 16
## Columns: 4
## $ Group    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID       <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean     <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273...
## $ baseline <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555...
# Fit the ANCOVA model and see the results:
fit <- lm(mean ~ baseline + Group, data = RATSL8S2)
summary(fit)
## 
## Call:
## lm(formula = mean ~ baseline + Group, data = RATSL8S2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.732  -3.812   1.991   6.889  13.455 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.14886   19.88779   1.516   0.1554    
## baseline     0.93194    0.07793  11.959 5.02e-08 ***
## Group2      31.68866   17.11189   1.852   0.0888 .  
## Group3      21.52296   21.13931   1.018   0.3287    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.62 on 12 degrees of freedom
## Multiple R-squared:  0.9947, Adjusted R-squared:  0.9933 
## F-statistic: 747.8 on 3 and 12 DF,  p-value: 6.636e-14
anova(fit)
## Analysis of Variance Table
## 
## Response: mean
##           Df Sum Sq Mean Sq   F value    Pr(>F)    
## baseline   1 252125  252125 2237.0655 5.217e-15 ***
## Group      2    726     363    3.2219   0.07586 .  
## Residuals 12   1352     113                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems that we can separate the different groups even when the baseline is taken into account.

Task 2. Analyzing the BPRS Data

First, let’s load the dataset and modify it a bit.

url="https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt"

BPRS=read.csv(url, sep=" ")

BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)

# Convert to long form
BPRSL <-  BPRS %>% gather(key = weeks, value = bprs, -treatment, -subject)

# Extract the week number
BPRSL <-  BPRSL %>% mutate(week = as.integer(substr(weeks,5,5)))

The dataset BRPS consists 40 rows and 11 variables. Two of the variables are related to identifying and nine are the actual time series.

glimpse(BPRSL)
## Rows: 360
## Columns: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ weeks     <chr> "week0", "week0", "week0", "week0", "week0", "week0", "we...
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 6...
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
p1 <- ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ treatment, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
p6

BPRSL_reg <- lm(bprs ~ week + treatment, data = BPRSL)
summary(BPRSL_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16
# dummies (in Table) vs summary output: D1 = Group2, D2 = Group3
p1 <- ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 10))
p4 <- p3 + scale_y_continuous(name = "Weight (grams)")
p5 <- p4 + theme_bw() + theme(legend.position = "top")
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6

pairs(BPRS, cex = 0.7)

library("lme4")
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
BPRSL_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRSL_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.437       
## treatment2 -0.282  0.000
# dummies (in Table) vs summary output: D1 = Group2, D2 = Group3
BPRSL_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRSL_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
# dummies (in Table) vs summary output: D1 = Group2, D2 = Group3
anova(BPRSL_ref1, BPRSL_ref)
## Data: BPRSL
## Models:
## BPRSL_ref: bprs ~ week + treatment + (1 | subject)
## BPRSL_ref1: bprs ~ week + treatment + (week | subject)
##            npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)  
## BPRSL_ref     5 2748.7 2768.1 -1369.4   2738.7                       
## BPRSL_ref1    7 2745.4 2772.6 -1365.7   2731.4 7.2721  2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
BPRSL_ref2 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRSL_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9802   -0.51
##  Residual             97.4305  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
# dummies (in Table) vs summary output: D1 = Group2, D2 = Group3
anova(BPRSL_ref1, BPRSL_ref2)
## Data: BPRSL
## Models:
## BPRSL_ref1: bprs ~ week + treatment + (week | subject)
## BPRSL_ref2: bprs ~ week + treatment + (week | subject)
##            npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
## BPRSL_ref1    7 2745.4 2772.6 -1365.7   2731.4                    
## BPRSL_ref2    7 2745.4 2772.6 -1365.7   2731.4     0  0
Fitted <- fitted(BPRSL_ref2)
BPRSL <- BPRSL %>% mutate(Fitted)
p1 <- ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20))
p4 <- p3 + scale_y_continuous(name = "Weight (grams)")
p5 <- p4 + theme_bw() + theme(legend.position = "right") # "none" in the book
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p7 <- p6 + ggtitle("Observed")
graph1 <- p7

p1 <- ggplot(BPRSL, aes(x = week, y = Fitted, linetype = subject))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + scale_x_continuous(name = "Time (days)", breaks = seq(0, 60, 20))
p4 <- p3 + scale_y_continuous(name = "Weight (grams)")
p5 <- p4 + theme_bw() + theme(legend.position = "right")
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p7 <- p6 + ggtitle("Fitted")
graph2 <- p7
graph1; graph2

That’s All Folks